library(xgboost)
source("scripts/error_analysis.r")
source("scripts/training_fcts.r")

Boosting

Previously we trained a random forest model, which did okay. This time we’re going to fast forward two decades and use XGBoost (eXtreme Gradient Boost). While it was surely named by a 12 year old at heart, this model has also been known as the “GBM (Gradient Boosting Machine) Killer”, since it has performed Xtremely competitively. The details of this algorithm is rather complicated, but in essence boosting is another ensemble technique.

While random forest uses bagging (boostrap aggregation) to randomly sample the input with replacement, and average the results by some metric to reduce overfitting, boosting takes inherently weak classifiers in conjunction to form a more powerful one. Other algorithms in this family are the adaboost (adaptive boost), gradient boost, CatBoost, etc.

We use XGBoost to skip to the head honcho of boosting. It’s not the latest and leading edge, but it’s a proven leader. And unlike adaboost and some others, XGBoost handles multiclass classification and regression inherently.

One-Hot vs. PCA vs. W2V

Real quick, let’s figure out which set to use with a baseline model. I know this isn’t the most foolproof way, but it’ll save some time for now and give a general indication.

# This stuff won't change between trials.
users = trainfuncs$read.csv('users_clean.csv')
train = trainfuncs$read.csv('reviews_simplified.csv')
train = train[, c("uid", "bid", "stars")]
val = trainfuncs$read.csv('validate_simplified.csv')
test = trainfuncs$read.csv('test_simplified.csv')

# This stuff will.
load_data = function(business_set) {
  business_file <<- paste('business_', business_set, '.csv', sep='')
  business <<- trainfuncs$read.csv(business_file)
  combined <<- create_dmatrix(rbind(train, val), split=TRUE)
  watchlist <<- create_watch()
  dtest <<- create_dmatrix(test, split=FALSE)
}

# Convert data.frame to XGBoost's dense matrix format.
create_dmatrix = function(df, split=TRUE) {
  df = trainfuncs$join_sets(df, split)
  if (split) {
    xgb.DMatrix(as.matrix(df$X), label=df$y)  
  } else {
    xgb.DMatrix(as.matrix(df)) 
  }
}

create_watch = function() {
  dtrain = create_dmatrix(train)
  dval = create_dmatrix(val)
  
  # Return watchlist
  list(train=dtrain, eval=dval)
}

Let’s train an XGBoost model for each set and see which one works better preliminarily.

trainfuncs$set_alg_name("xgboost")
default_params = list(eta = 0.3, 
                      gamma = 0.5,
                      max_depth = 5, 
                      min_child_weight = 5,
                      subsample = 0.8, 
                      colsample_bytree = 0.8, 
                      lambda = 1, alpha = 0.2, 
                      scale_pos_weight = 1,
                      max_delta_step = 0,
                      objective = "reg:linear")

params = trainfuncs$get_params()

train_fct = xgb.train
predict_fct = predict
train_args = list(nrounds = 1000,
                  early_stopping_rounds = 25,
                  metrics = list("rmse"),
                  verbose = 1,
                  print_every_n = 10)

# Assumes watchlist has already been constructed.
train_ = function(train_args, params) {
  if (params$final) {
    train = combined
  } else {
    train = watchlist$train
  }
  train_args = c(data=train, watchlist=list(watchlist),
                 train_args)
  params$args = list(params=params$args)
  trainfuncs$train(train_args, params)
}
predict_ = function(model, params=NULL) {
  if (params$final) {
    test = dtest
  } else {
    test = watchlist$eval
  }
  trainfuncs$predict(model, newdata=test, 
                     labels=getinfo(test, "label"),
                     params=params)
}
train.predict = function(train_args, params) {
  model = train_(train_args, params)
  predict_(model, params)
}
# Train model for label encoding.
load_data('labels')
model = train_(train_args, params)
## [1]  train-rmse:2.527726 eval-rmse:2.629131 
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
## 
## [11] train-rmse:0.998421 eval-rmse:1.045882 
## [21] train-rmse:0.990811 eval-rmse:1.045447 
## [31] train-rmse:0.987147 eval-rmse:1.046156 
## Stopping. Best iteration:
## [14] train-rmse:0.994162 eval-rmse:1.044463
# Train model for one-hot.
load_data('onehot')
pred = train_(train_args, params)
## [1]  train-rmse:2.527970 eval-rmse:2.630028 
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
## 
## [11] train-rmse:1.000323 eval-rmse:1.052716 
## [21] train-rmse:0.990041 eval-rmse:1.045635 
## [31] train-rmse:0.986715 eval-rmse:1.046229 
## [41] train-rmse:0.984178 eval-rmse:1.046446 
## Stopping. Best iteration:
## [23] train-rmse:0.989308 eval-rmse:1.045371
# Train model for PCA.
load_data('pca')
model = train_(train_args, params)
## [1]  train-rmse:2.539426 eval-rmse:2.655265 
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
## 
## [11] train-rmse:0.995908 eval-rmse:1.049882 
## [21] train-rmse:0.985818 eval-rmse:1.048946 
## [31] train-rmse:0.981104 eval-rmse:1.049808 
## Stopping. Best iteration:
## [13] train-rmse:0.992072 eval-rmse:1.047983
# Train model for Word2Vec.
load_data('wv')
model = train_(train_args, params)
## [1]  train-rmse:2.539866 eval-rmse:2.656197 
## Multiple eval metrics are present. Will use eval_rmse for early stopping.
## Will train until eval_rmse hasn't improved in 25 rounds.
## 
## [11] train-rmse:0.996692 eval-rmse:1.050527 
## [21] train-rmse:0.986050 eval-rmse:1.047638 
## [31] train-rmse:0.980008 eval-rmse:1.049395 
## [41] train-rmse:0.975048 eval-rmse:1.050839 
## Stopping. Best iteration:
## [18] train-rmse:0.987941 eval-rmse:1.047256

It looks like labels and one-hot encoding actually performed the best. It’s actually better than Random Forest and CatBoost right now, by default. I did an extensive trial run with XGBoost before but couldn’t get it right, but I guess I have to re-evaluate. We’ll use one-hot, since it’s more mathematically sound.

Analysis

Even though our best score for this model is comparatively horrid (it did worse than linear regression), we can make a final model and find, for instance, what it sees as important features.

model = readRDS('models/xgboost_final_1.8_1_0.1_15_0.1_11_0.95.rds')
y = getinfo(combined, 'label')
pred = trainfuncs$predict(model, newdata=combined)
score = analysis$rating_score(pred, y)
score
##    accuracy precision     recall    fscore
## 1 0.9395933 0.9044527 0.07516189 0.1387900
## 2 0.8995901 0.2371190 0.07592216 0.1150174
## 3 0.7235222 0.2877678 0.38119552 0.3279576
## 4 0.5316087 0.3974652 0.74498694 0.5183698
## 5 0.7267871 0.8084740 0.23844544 0.3682746
library(ranger)
rf = readRDS('models/ranger_final_40_306_respec.rds')
business = trainfuncs$read.csv('business_cats.csv')
combined = rbind(train, val)
combined = trainfuncs$join_sets(combined, split=TRUE)
pred = trainfuncs$predict(rf, newdata=combined$X, 
                          pred_obj="predictions")
rf_score = analysis$rating_score(pred, combined$y)
rf_score
##    accuracy precision     recall    fscore
## 1 0.9406767 0.9892183 0.08487512 0.1563365
## 2 0.8952119 0.2639740 0.12262562 0.1674599
## 3 0.7710387 0.3960056 0.55933877 0.4637098
## 4 0.6675037 0.5051141 0.85327490 0.6345770
## 5 0.7907133 0.9687711 0.38579970 0.5518377
analysis$score_dist(score, rf_score)
##       accuracy   precision       recall      fscore
## 1 -0.001083326 -0.08476564 -0.009713228 -0.01754649
## 2  0.004378236 -0.02685498 -0.046703456 -0.05244256
## 3 -0.047516587 -0.10823780 -0.178143248 -0.13575224
## 4 -0.135895042 -0.10764895 -0.108287961 -0.11620714
## 5 -0.063926234 -0.16029714 -0.147354260 -0.18356304

And this highlights a serious flaw in our error analysis methodology. I’m guessing that if the predictions are closer to actual scores in most instances, but more rounds to the wrong number, it is possible to have better RMSE but worse accuracy, precision, and recall. In other words, our predictions for each rating have higher variance, but higher kurtosis (picture a normal curve with longer tails but a higher middle).

Therefore, we’ll update our error analysis suite next.